|
|
|
Author: Ijeoma Nwachukwu
Date: 2025-08-26
The Biostatistics and Health Data Science Group, is a multi-disciplinary academic research and teaching under the IAHS characteristic by collaborative research, consultancy and training across clinical, biological and global health domains. In the global health domain where I was assigned to, the data used to conduct the research as well as for training purposes are collected from a number of secure sources, including the The DHS-Program.
The DHS-Program, funded by USAID collects nationally representative global health data, to monitor and evaluate population, health, and nutrition programs, providing data to track approximately 30 SDG indicators. They provides these data for tracking as well as measure to track them, contributing significantly towards achieving the SDG 3 and 5 (The DHS Program, 2025).
However, the DHS-Program has been suspended and currently undergoing review for further funding. During the period of this review, new registrations are not being accepted, hence restricting access to datasets commonly used by undergraduate and post graduate students for their theses and training, especially in LMICs, thereby significantly hampering preparations for future national and global health leadership training in addition to other far-reaching effects.
My project focused on collecting, organizing, merging and analyzing, datasets from DHS-program relevant to our global health projects. While this mitigates the recent suspension of the DHS-program for students and researchers within the team working on global health projects, it also gave me an opportunity to familiarize with global health data and perform exploratory data analysis on aspects of Gender Inequality including Female Genital Mutilation, Intimate Partner Violence and Autonomy of Health Care Decision Making which are often intertwined and are prevalent issues for women of child bearing age in LMICs(Wessells & Kostelny, 2022).
This project achieved two aims
I accessed the data from DHS-Program website using my supervisor’s login. Exploratory data analysis was done using harmonized datasets from IPUMS-DHS website which are harmonized data from the Demographic and Health Surveys (DHS) across countries and over time. The data is free to use for research and teaching purposes, however, users must register for an account and agree to the terms of use.
To access datasets, new users must register for an account on the The DHS-Program website and apply for datasets on the IPUMS-DHS website.
The project was carried out in three phases:
Autodownload of DHS Datasets
DHS IR File Merge (Pilot merge)
Exploratory Data Analysis of Gender Inequalities using IPUMS-DHS Data:
Documentation was ensured all through the project with clear instructions and explanation of codes to ensure transparency and reproducibility of the workflow results and analysis results.
A structured reproducible workflow was scripted using R Markdown
which serves as a comprehensive toolkit for accessing, processing, and
locally managing DHS downloads was , enabling seamless data retrieval
for collaborative research in support of global health studies. It
ensure secure data access, automates downloads, and systematically
unzips, organizes and saves the datasets in hierarchical file
structure.FileName/CountryName/SurveyYear/DataType. The
workflow is sppcifically for DHS Datasets in SPSS and STATA formats as
specified in my project tasks.
A structured, reproducible workflow was developed to merge DHS
Individual Recode (IR) datasets for 2 countries (Kenya and Tanzania
2022) using SPSS Syntax. A cross-country unique identifiers
UCASEID was created by concatenating country-cluster and
case IDs. Subsets containing the UCASEID and relevant IPV
variables were saved and merged using SPSS commands. This workflow can
be adapted for additional countries and survey rounds, and replicated
for different variables, provided that the variable names, labels, and
meanings are first confirmed to be consistent according to the DHS
Recode Manual(The DHS Program, 2025). See syntax of workflow in Appendix 1 set to do-not-run.
The data was explored using SPSS Cross-tabulation of the variables and R-Plotly visualization of spss-outputs for the following. 1. IPV: percentage of women slapped in last 12 month (frequency), variable code= (DVPSLAPFQ) 2. FGM: percentage of ever circumcised women within country, variable code= (FCCIRC) 3. AHCDM: percentage of women who have the final say on their health care within country, variable code= (FCCIRC)
This plot presents the distribution of women’s reported experiences with intimate partner violence (IPV) across countries. The response categories include: “Not ever slapped,” “Often during last 12 months,” “Sometimes during last 12 months,” “Not at all in last 12 months,” and “Yes, timing and frequency unknown.” Most countries show that a significant proportion of women have never been slapped by an intimate partner, but in many settings, notable percentages report being slapped at least sometimes or often within the past year. Variability across countries is visible, with some (e.g., Sao Tome and Principe, Zimbabwe) having higher frequencies of violence, and others (e.g., India, Senegal) showing larger shares of respondents reporting no experience of IPV.
library(plotly)
library(dplyr)
library(readxl)
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.4.3
library(forcats)
## Warning: package 'forcats' was built under R version 4.4.3
# Read data
df_ipv <- read_excel("percentage of women slapped in last 12 month (frequency).xlsx", sheet = "sheet1")
response_ipv <- c(
"Not ever slapped",
"Often during last 12 months",
"Sometimes during last 12 months",
"Not at all in last 12 months",
"Yes, timing and frequency unknown"
)
# Multiply resp_colums by 100
df_ipv <- df_ipv %>%
mutate(across(all_of(response_ipv), ~ . * 100))
# Reshape to long format for plotting
df_ipv_long <- df_ipv %>%
select(country, all_of(response_ipv)) %>%
pivot_longer(
cols = -country,
names_to = "Response",
values_to = "Percent"
) %>%
mutate(Response = gsub(" %", "", Response)) # Clean up response label
#arrange bars in ascending order
ipvcountry_order <- df_ipv_long %>%
group_by(country) %>%
summarize(total_percent = sum(Percent, na.rm = TRUE)) %>%
arrange(total_percent) %>%
pull(country)
# Set country factor levels according to ascending total percent
df_ipv_long <- df_ipv_long %>%
mutate(country = factor(country, levels = ipvcountry_order))
#Generate interactive plot using plotly
fig1_ipv <- plot_ly(
df_ipv_long,
y = ~country,
x = ~Percent,
color = ~Response,
type = "bar",
orientation = "h"
) %>%
layout(
barmode = "stack",
title = "Percentage of Women Ever Slapped by an Intimate Partner within the Last 12 Months",
xaxis = list(title = " "),
yaxis = list(title = " "),
legend = list(title = list(text = "Response Category"))
)
fig1_ipv
This plot shows the percentage response of women’s who have experienced female genital mutilation/cutting (FGM/C) within country, with responses categorized as “yes,” “no,” and “don’t know.” There is wide country variation: nations like Guinea, Sierra Leone, Mali, Gambia, and Egypt show extremely high percentages of women reporting being circumcised (often over 80%), while countries such as Ghana, Cameroon, Tanzania, and others report relatively low response rate. The “don’t know” response is almost negligible in most contexts, indicating good awareness or clear reporting. The significant country-to-country differences reflects varying cultural, legal, and historical norms about FGM/C practices.
# Read data
df_fgm <- read_excel("percentage of ever circumcised women within country.xlsx", sheet = "sheet1")
response_cols <- c(
"no",
"yes",
"don't know"
)
# Reshape to long format for plotting
df_fgm_long <- df_fgm %>%
select(country, all_of(response_cols)) %>%
pivot_longer(
cols = -country,
names_to = "Response",
values_to = "Percent"
) %>%
mutate(Response = gsub(" %", "", Response)) # Clean up response label
#arrange bars in ascending order
fgmcountry_order <- df_fgm_long %>%
group_by(country) %>%
summarize(total_percent = sum(Percent, na.rm = TRUE)) %>%
arrange(total_percent) %>%
pull(country)
# Set country factor levels according to ascending total percent
df_fgm_long <- df_fgm_long %>%
mutate(country = factor(country, levels = fgmcountry_order))
#Generate interactive plot using Plotly
fig1_fgm <- plot_ly(
df_fgm_long,
y = ~country,
x = ~Percent,
color = ~Response,
type = "bar",
orientation = "h"
) %>%
layout(
barmode = "stack",
title = "Percentage of Women Ever Circumsised",
xaxis = list(title = " "),
yaxis = list(title = " "),
legend = list(title = list(text = "Response Category"))
)
fig1_fgm
The chart explores women’s reported autonomy and roles in health care decision-making. The response categories include “Woman alone,” “Woman and husband/partner,” “Woman and someone else,” “Husband/partner,” “Someone else,” and “Family elders/relatives.” In many countries, the largest proportion of women say decisions are made “with their husband/partner” or by their “husband/partner” alone, reflecting persistent gender norms around health autonomy. However, countries such as Mozambique, Lesotho, and Madagascar display higher shares for “Woman alone,” indicating stronger female decision-making autonomy. “Woman and someone else” and “Family elders/relatives” are minor categories in most contexts, suggesting these are less common arrangements for household health decisions. Country patterns reflect diverse sociocultural structures and levels of empowerment.
# Read data
dfahcdm <- read_excel("percentage of women who have the final say on their health care within country.xlsx", sheet = "sheet1")
response_ahcdm <- c(
"Woman alone",
"Woman and husband/partner",
"Woman and someone else",
"Husband/partner",
"Family elders/relatives"
)
# Reshape to long format for plotting
ahcdm_long <- dfahcdm %>%
select(country, all_of(response_ahcdm)) %>%
pivot_longer(
cols = -country,
names_to = "Response",
values_to = "Percent"
) %>%
mutate(Response = gsub(" %", "", Response)) # Clean up response label
#arrange bars in ascending order
ahcdmcountry_order <- ahcdm_long %>%
group_by(country) %>%
summarize(total_percent = sum(Percent, na.rm = TRUE)) %>%
arrange(total_percent) %>%
pull(country)
# Set country factor levels according to ascending total percent
ahdcm_long <- ahcdm_long %>%
mutate(country = factor(country, levels = ahcdmcountry_order))
#Generate interactive plot using Plotly
fig1_ahcdm <- plot_ly(
ahcdm_long,
y = ~country,
x = ~Percent,
color = ~Response,
type = "bar",
orientation = "h"
) %>%
layout(
barmode = "stack",
title = "percentage of women who have the final say on their health care within country",
xaxis = list(title = " "),
yaxis = list(title = " "),
legend = list(title = list(text = "Response Category"))
)
fig1_ahcdm
Data and Output files are saved to One drive folder in the below order
DHS-Download Task
├── [DHS_Downloads]
└── [Downloads report, metadata, log]
[Gender Inequalities]
├── [DHS]
│ ├── [dhs-ir-piolt-merge-KE8_TZ8]
│ └── [planning-and-var-map]
└── [IPUMS]
├── [ipums-analysis]
│ ├── [spss-analysis]
│ └── [r-project-files-exec-report]
├── [ipums-data-extracts-comd-files]
├── [ipums-ir-dataset]
└── [ipums-planning-and-var-map]
└── [ipums-planning-and-var-map]
The DHS Program. (2025).Sustainable Development Goals. https://dhsprogram.com/topics/sdgs/index.cfm (Accessed August 28, 2025)
The DHS Program. (2025). Merging datasets. https://dhsprogram.com/data/Merging-datasets.cfm (Accessed September 1, 2025)
Wessells, M. G., & Kostelny, K. (2022). The psychosocial impacts of intimate partner violence against women in LMIC contexts: Toward a holistic approach. International Journal of Environmental Research and Public Health, 19(21), 14488. https://doi.org/10.3390/ijerph192114488*
*SPSS
* Encoding: UTF-8.
*SPSS Version 30.0.0.0(172)
* Encoding: UTF-8.
*Check Recode file to confirm variable names context match. For this pilot merging, KEIR8CFL.SAV and TZIR82FL.SAV were conducted in the same year and survey phase (Ist Survey conducted in DHS Phase 8, in 2022).
*KEIR8CFL.SAV however is a continuous DHS Dataset. Create a copy of original dataset as these changes will over-write the original dataset. UNless otherwise specified as in Step 2
*STep1: Create Unique ID using V000 and Case ID variables from both files. to merge from Dataset 1( KEIR8CFL.SAV )
*Unique ID for Kenya; Dataset 1( KEIR8CFL.SAV ).
DATASET ACTIVATE DataSet1.
STRING UCASEID (A20).
COMPUTE UCASEID=CONCAT(V000,CASEID).
VARIABLE LABELS UCASEID 'Unique Case ID'.
EXECUTE.
*Unique ID for Tanzania; Dataset 2( TZIR82FL.SAV ).
DATASET ACTIVATE DataSet2.
STRING UCASEID (A20).
COMPUTE UCASEID=CONCAT(V000,CASEID).
VARIABLE LABELS UCASEID 'Unique Case ID'.
EXECUTE.
*Step 2: Select Unique case ID along with IPV variables from both datasets for merging. Save them with a different name. Modify file path.
DATASET ACTIVATE DataSet1.
SAVE OUTFILE='C:\Users\Desktop\_KEIR8CFL.SAV'
/KEEP UCASEID V000 V001 V003 V004 V005 V006 V007 G100 G101 G102 G103 G104 G105 G107 V005.
DATASET ACTIVATE DataSet2.
SAVE OUTFILE='C:\Users\Desktop\_TZIR82FL.SAV'
/KEEP UCASEID V000 V001 V003 V004 V005 V006 V007 G100 G101 G102 G103 G104 G105 G107 V005.
*Open _KEIR8CFL.SAV and _TZIR82FL.SAV as Datasets 3 and 4 respectively
*Step 3: Merge all variables.
DATASET ACTIVATE DataSet3.
ADD FILES /FILE=*
/FILE='DataSet4'.
EXECUTE.
*By default, the active dataset (Dataset3 _KEIR8CFL.SAV) is modified to contain the merged cases from the other dataset (Dataset4 _TZIR82FL.SAV).
SAVE OUTFILE='C:\Users\Desktop\KE8-TZ8-ir-ipv.SAV'
/COMPRESSED.
* Encoding: UTF-8
*Version 29.0.2.0 (20)
Naming conventions for CROSS TABULATIONS results for further analysis
1. ipv: percentage of women slapped in last 12 month (frequency), variable code= (DVPSLAPFQ)
2. fgm: percentage of ever circumcised women within country, variable code= (FCCIRC)
3. ahcdm: percentage of women who have the final say on their health care within country, variable code= (FCCIRC)
*Load datset.
GET
FILE='C:\Users\Desktop\ipums-ir-dataset.sav'.
DATASET ACTIVATE DataSet1.
CROSSTABS
/TABLES=COUNTRY BY DVPSLAPFQ
/FORMAT=AVALUE TABLES
/CELLS=COUNT ROW COLUMN
/COUNT ROUND CELL.
CROSSTABS
/TABLES= COUNTRY BY FCCIRC
/FORMAT=AVALUE TABLES
/CELLS=COLUMN
/COUNT ROUND CELL.
CROSSTABS
/TABLES=COUNTRY BY DECFEMHCARE
/FORMAT=AVALUE TABLES
/CELLS=COUNT ROW COLUMN
/COUNT ROUND CELL..
*-----------------------------------------------------------------------.
*For data cleaning in excel
1. remove:
First 3 row headings
2. Name col1: country
3. Populate country column
4. Filter and remove:
- All row/col Totals
- cols:
Not in Universe col
Missing
-rows: in count/% col
Blank
All rows except % within country (for fgm and ahcdm)
5.Number format is Percentage